control point
On the Construction and Implications of Low-Loss Valleys in LoRA-based Bayesian Inference
Dold, Daniel, Sommer, Emanuel, Kobialka, Julius, Dürr, Oliver, Rügamer, David
While parameter-efficient fine-tuning methods like low-rank adaptation (LoRA) are standard for large language models, principled estimation of epistemic uncertainty remains challenging. Recent results in the LoRA regime suggest that discrete multi-mode approaches such as deep ensembles offer little benefit over single-mode methods. This contradicts broader observations in deep learning, where ensembling independent optima typically improves generalization, and linking these modes through continuous low-loss valleys further enhances Bayesian model averaging (BMA). Whether such structure exists in the LoRA space and whether it yields functional diversity missed by local or discrete methods has not been studied. We introduce LoRA-Curve, a segmented Bézier curve parameterization in the LoRA space, with two variants: a free configuration that jointly optimizes all control points, and an anchored configuration that connects independently fine-tuned LoRA optima. We prove pathwise continuity and Lipschitz regularity of the loss along the curve and empirically show, across reasoning and classification benchmarks with Qwen2.5 7B, that linear interpolation encounters loss barriers, while our anchored multi-segment curves connect independent optima through continuous low-loss valleys. Combined with flat-minima perturbations and a Jensen-Shannon divergence regularizer, LoRA-Curve yields measurably higher mutual information of the predictive distribution without sacrificing performance, and links continuous parameter-space traversal to functional diversity.
Results
In this section we prove the theoretical results around the dual curriculum game and use these results to show approximation bounds for our methods, given that they have reached a Nash equilibrium (NE). The first theorem is the main result that allows us to analyze dual curriculum games. The high-level result says that the NE of a dual curriculum game are approximate NE of the base game from the perspective of any of the individual players, or from the perspective of the joint strategy. Let Bbe the maximum difference between U1t and U2t, and let (π,θ1,θ2) be a NE for G. Then (π,pθ1 + (1 p)θ2) is an approximate NE for the base game with either teacher or for a teacher optimizing their joint objective. More precisely, it is a 2Bp(1 p)-approximate NE when Ut = pU1t + (1 p)U2t, a 2B(1 p)-approximate NE when Ut = U1t, and a 2Bp-approximate NE when Ut = U2t. At a high level, this is true because, for low values of p, the best-response strategies for the individual players can be thought of as approximate-best response strategies for the joint-player, and vis-versa. Since the Nash Equilibrium consists of each of the players playing their own best response, they must be playing an approximate best response for the joint-player. We provide a formal proof below: Proof. Let B be the maximum difference between U1t and U2t, and let (π,θ1,θ2) be a Nash Equilibrium for G. Then consider pθ1 + (1 p)θ2 as a strategy in the base game for the joint player pU1t + (1 p)U2t.
0e915db6326b6fb6a3c56546980a8c93-Supplemental.pdf
Let B be the maximum difference betweenU1t and U2t, and let (π,θ1,θ2) be a Nash Equilibrium forG. Let π1 be the best response to the first teacher (with utilityU1t) and let π1+2 be the best response policy to the joint teacher. This result shows that as we reduce the number of random episodes, the approximation to aminimax regret strategy improves. Let G be the dual curriculum game in which the first teacher maximizes regret, so U1t = URt, and the second teacher plays randomly, soU2t = UUt . Finally,we need to show thatπ2+3 isoptimal for the student.